Functional and structural analysis of non-synonymous single nucleotide polymorphisms (nsSNPs) in the MYB oncoproteins associated with human cancer

MYB proteins are highly conserved DNA-binding domains (DBD) and mutations in MYB oncoproteins have been reported to cause aberrant and augmented cancer progression. Identification of MYB molecular biomarkers predictive of cancer progression can be used for improving cancer management. To address this, a biomarker discovery pipeline was employed in investigating deleterious non-synonymous single nucleotide polymorphisms (nsSNPs) in predicting damaging and potential alterations on the properties of proteins. The nsSNP of the MYB family; MYB, MYBL1, and MYBL2 was extracted from the NCBI database. Five in silico tools (PROVEAN, SIFT, PolyPhen-2, SNPs&GO and PhD-SNP) were utilized to investigate the outcomes of nsSNPs. A total of 45 nsSNPs were predicted as high-risk and damaging, and were subjected to PMut and I-Mutant 2.0 for protein stability analysis. This resulted in 32 nsSNPs with decreased stability with a DDG score lower than − 0.5, indicating damaging effect. G111S, N183S, G122S, and S178C located within the helix-turn-helix (HTH) domain were predicted to be conserved, further posttranslational modifications and 3-D protein analysis indicated these nsSNPs to shift DNA-binding specificity of the protein thus altering the protein function. Findings from this study would help in the field of pharmacogenomic and cancer therapy towards better intervention and management of cancer.

Identification of deleterious nsSNPs in MYB family genes. The nsSNPs were then subjected to five different tools (PROVEAN, SIFT, Polyphen2, SNPs&GO and PhD-SNP) that has different prediction algorithms to identify nsSNP with significant deleterious effects which could affect the biological structure and the function of MYB proteins. Forty-eight nsSNPs were identified as "pathogenic" or "damaging" by all tools, hence classified at "high-risk" ( Table 1).
Verification of high risk nsSNPs by PMut. The selected damaging nsSNPS were then submitted to PMut server to determine the probability score and the status of prediction of the resultant protein due to mutations. Table 2 shows the prediction scores and statuses. All MYBL2 nsSNPs were predicted as high-risk, whereas 16 nsSNPs and 14 nsSNPs from MYB and MYBL1 genes were also identified as "disease". The "disease" status indicates that the mutated proteins are predicted to be pathogenic.
Determination of protein structural stability by I-Mutant 2.0. The structural stability of resultant proteins was predicted by I-Mutant 2.0. The output of I-Mutant 2.0 was expressed in free energy change value (DDG) and reliability index (RI). In total, 41 nsSNPs were confirmed to cause decrease in stability to the resultant proteins, however only 32 nsSNPs were predicted to have a DDG value < − 0.5, indicating its greater impact towards the proteins. Evolutionary conservation analysis by ConSurf. The evolutionary conservation was determined through subjecting the mutated protein sequences to the ConSurf web server. A total of thirty-six nsSNPs of MYB family genes were identified as functional, highly conserved with exposed amino acid residues, whereas six nsSNPs were predicted as structural, highly conserved and buried. On the contrary, a total of four nsSNPs were predicted to be exposed but not functional, while two nsSNPs were predicted as buried and not functional ( Table 2).

Prediction of post-translational modification (PTM) sites. Post-translational modification (PTM)
refers to the process where proteins undergo chemical modification to become functional and participate in respective cellular activities 16 . Putative methylation sites in the MYB family proteins and the 30 high risk nsSNPs were predicted using MusiteDeep and GPS-MSP 1.0. Only the wild-type R554 in MYBL1 was predicted as the common site (Fig. 1). Phosphorylation sites in the native and 30 mutated proteins were predicted using NetPhos 3.1 and GPS 5.0. NetPhos 3.1 predicted 268 residues in the three proteins to have phosphorylation potentials. A total of 386 residues in the proteins were found to be phosphorylated using the GPS 5.0, and 261 phosphorylation sites in all proteins were identified using both the tools, which consisted of 154 serines, 93 threonines, and 14 tyrosines (Fig. 1). However, only two wild-type reisudes (S175 and S178) and four mutant residues (S111, S183, S574, and S122) were the common sites present in the high risk nsSNPs. The ubiquitylation predictors employed in this study were BDM-PUB and UbiNet 2.0. BDM-PUB found 110 ubiquitylation sites at lysine residues in the family proteins. Whereas, only the wild-type K84 and K109 were predicted by UbiNet 2.0 in MYB. Both BDM-PUB and UbiNet 2.0 did not have common findings in all three proteins. After assembling the results from ConSurf and various PTM tools, a total of five high risk nsSNPs; G111S, N183S, P574S, G122S, and S178C containing putative phosphorylation sites were selected to proceed with comparative 3D modelling (Table 3).

Comparative modelling of wild-type MYB family proteins and their mutant structures.
To investigate if the five high risk nsSNPs substantially alter the resultant proteins, predictive 3D modelling was performed along with the structural comparisons between wild-type and mutant models. The c1h88C and c1mseC templates were used to predict the wild-type MYB family proteins and their mutant models, excluding the P574S structure as this mutant residue was not covered in either template. TM-align revealed all mutant models had values of TM-score = 0 and RMSD = 1, showing no structural variations from their wild-type forms (Table 4). SWISS-MODEL was used to construct the 3D models for the wild-type proteins and their mutants. The best  (Table 4). These models were visualised in Chimera 1.15 and the corresponding mutation positions induced by the nsSNPs were affirmed (Fig. 2). The structural integrity of generated wild-type and mutant protein structures were further validated with Ramachandran plot through the dihedral angles calculated. Wild-type and mutant PDB inputs were subjected to PROCHECK for analysis. The wildtype MYB has 120 residues (87.6%) in the most favored region, 17 residues (12.4%) in the additional allowed region. The more damaging mutants, G111S has 122 residues (88.4%) in the most favored region, 15 residues (10.9%) in the additional allowed region and 1 residue (0.7%) in the disallowed region, followed by N183S possesses 122 residues (89.1%) in the most favored region, 15 residues (10.9%) in the additional allowed region. In MYBL1, both G122S mutant and wildtype possess the same amino acid residue patterns, 123 residues (89.1%) in the most favoured region and 15 residues (10.9%) in the additional allowed region, indicating no significant changes in the alteration in the structure. Wildtype MYBL2 has 119 residues (86.2%) in the most favoured region and 19 residues (13.8%) in the additional allowed region. Mutant S178C in MYBL2 has 118 residues (85.5%) in the most favoured region and 20 residues (14.5%) in the additional allowed region as shown as in Table 4. Highly conserved and exposed (f)

Discussion
This study has successfully identified high-risk pathogenic nsSNPs in the MYB oncoproteins towards understanding its association with human cancer using an in-silico approach. MYB oncoproteins play crucial roles in multiple signalling pathways for cellular activities. A study by Andersson and colleagues 17 showed that overexpressed wild-type MYB genes are normally benign, however, overexpression accompanied by gene alteration, dysregulated gene rearrangement or the incorrect oncoprotein binding onto enhancer region could promote tumorigenesis 1 . Despite the mutations in MYB oncoproteins being reported frequently in numerous cancers, the precise mechanisms of tumour initiations and/or maintenance remains vague. Therefore, examining the outcomes of deleterious nsSNPs of the MYB oncoproteins could potentially pave ways into a better understanding thus revealing its deleterious effects. Therefore, the aim of this study is to develop a bioinformatics pipeline in determining the most damaging nsSNPs and their effects on the structure and function of MYB, MYBL1, and MYBL2 proteins. A total of 51,862 SNPs were extracted from the NCBI dbSNP for the MYB family genes, of which 1503 were nsSNPs. Structural analysis using (PROVEAN, PolyPhen-2, SIFT, SNPs&GO, and PhD-SNP) and functional analysis using PMut resulted in 45 "high-risk" pathogenic nsSNPs. Next, the stability of these nsSNPs were determined using I-Mutant 2.0, where 41 nsSNPs were identified with "decreased stability". Protein stability is one of the key features to determine if a protein is biologically active and functional 18 . Proteins with decreased stability due to mutation might give rise to tumorigenesis as the fitness level for normal proteins dropped and conferred the fitness for tumorigenic proteins 19 .
Evolutionary conservation of MYB protein residues were calculated using ConSurf. The evolutionary conservation of an amino acid indicates its natural tendency for mutation to take place and highly conserved and exposed amino acids that undergo mutations can be expected to be most deleterious 20 . Thirty-six nsSNPs were identified to be highly conserved and exposed by ConSurf with the conservation score of 9. Next, these nsSNPs were subjected to PTMs analysis to determine its effect on regulating functions and structures of proteins. The G111S, N183S, P574S, G122S, and S178C showed to harbour putative phosphorylation sites. These nsSNPs coinciding with the putative phosphorylation sites may cause functional impairment and destabilisation of the corresponding proteins, thereby enhancing PTM impairment. PTM plays a pivotal role in modulating various protein functions and expressions, therefore mutations in PTM sites could lead towards malfunctions of the protein' regulatory mechanisms, contributing to cellular dysfunctions such as transformation into cancer cells 21 . Several studies showed that mutated residues at phosphorylation sites had led to detrimental alterations in the expressions and functions of MYB 22,23 and MYBL2 [24][25][26] . Among the four nsSNPs, G111S mutation in MYB was Table 3. High risk nsSNPs selected by considering ConSurf and PTM predictions. b buried residue, s predicted structural residue (highly conserved and buried), e exposed residue, f predicted functional residue (highly conserved and exposed). a Highly variable (1), highly conserved (9).   www.nature.com/scientificreports/ reportedly associated with uterine leiomyosarcomas 27 . Analysis using methylation and ubiquitylation predictors were also performed on the selected high risk nsSNPs. Only the R554 consensus methylation site was identified in MYBL1. As the results of methylation and ubiquitylation sites prediction were not in agreement for MYB and MYBL2, it was considered that both PTM sites were not predicted in these two proteins. Structural alterations on the resulting proteins were constructed using the Phyre2 homology modelling tool. Protein 3-dimensional analysis offers a detailed insight into the associated molecular changes 28 . Two templates (c1h88C and c1mseC) were utilised to construct the protein models. These templates were selected based on high sequence similarities and high GMQE value, providing a high coverage. The TM-scores and RMSD values obtained for the four mutants suggest that the nsSNPs might not have a significant structural consequence on the proteins for TM-align to detect. Protein structure homology-modelling tool SWISS-MODEL was conducted to remodel the four nsSNPs for structure and function prediction. The template (1h88.1.C) was used as it has high sequence identity and desired coverage range. GMQE scores of the models were estimated to be around 0.15-0.16, indicating that the models cover only 15-16% of the targeted sequence. All models have QMEAN Z-scores greater than − 4.0, indicating the high quality of the models. Finally, ERRAT evaluation gives overall quality factors that approximately hundred to all models, indicating high quality models were built.

Protein SNP ID Mutation
The G111S and G122S were identified to be located in the helix-turn-helix (HTH) myb-type 2 domain and S178C and N183S in the HTH myb-type 3 domain. These domains are important for the binding of DNA sequences and gene expression. This could cause activation and overexpression of the gene through loss of the C-terminal negative regulatory domain (C-myb) 29 . Previous studies have also reported that these could lead towards loss of the 3′ UTR binding sites thus negatively regulating MYB mRNA stability and translation 30,31 .
Thus, the nsSNPs identified within these regions may have a complete shift in the DNA-binding specificity, resulting in a pathogenic protein synthesis 32 .    38 . Those nsSNPs which were predicted to be deleterious by all five in silico tools were considered as "high-risk" nsSNPs and selected for further downstream analysis 39 . This ensured the stringency and accuracy of the results by incorporating the scores of all five computational tools to increase the precision of prediction.

Methods
Validating the high risk nsSNPs. PMut [http:// mmb. irbba rcelo na. org/ PMut/] was resorted to validate the pathological nature of the selected high risk nsSNPs 40 . This neural network-based tool includes 27,203 harmful and 38,078 benign mutations for 12,141 proteins. Prediction score ranging from 0 to 1 was computed along with the prediction percentage. The nsSNPs with a score of ≤ 0.5 are classified as neutral, whereas those with > 0.5 are predicted as disease-associated 40 .
Determining protein stability. Protein stability of the nsSNPs were determined through I-Mutant 2.0 [https:// foldi ng. biofo ld. org/i-mutant/ i-mutan t2.0. html] 41 . This tool determines the increase decrease of stability change in mutated protein, and simultaneously estimates the corresponding values of free energy change (DDG). I-Mutant 2.0 uses a support vector machine method and a ProTherm-derived dataset, which is the most collective databank containing experimental thermodynamic data of free energy changes in mutated protein stability 41 . Along with these predictions, a reliability index (RI) ranging from 0 (lowest reliability) to 10 (highest reliability) was also computed by this web server.
Protein evolutionary conservation analysis. ConSurf [https:// consu rf. tau. ac. il/] was used to predict the evolutionary conservation of each residue position in the native MYB proteins 41 . The prediction is based on an empirical Bayesian algorithm and the phylogenetic relations between close homologous sequences. For each amino acid position, a colorimetric conservation score between 1 and 9 is calculated by the tool and then classified as either a variable (1-4), intermediately conserved (5-6), or highly conserved residue (7-9). The exposed (on protein surface) or buried (inside protein core) status of each residue position in the protein structure is also determined. A functional residue is predicted when it is highly conserved and exposed, whereas a structural residue is predicted if it is highly conserved and buried 20,42 .  44 . Using a default cut-off of 0.5, the deep learning-based MusiteDeep predicts and labels the desired PTM sites in the sequence according to the confidence threshold 43 . As for GPS-MSP 1.0, types of mono, symmetrical di-, and asymmetrical di-methylation specific to arginines, as well as mono, di-and tri-methylation types specific for lysines were predicted 45 49 were employed to predict putative protein ubiquitination sites at the lysines in MYB family proteins. A balanced cut-off option and a threshold of 0.3 were selected for the BDM-PUB server to perform the prediction based upon Bayesian Discriminant Method (BDM) 50 .

Post-translational modification sites prediction.
Examining the effects of nsSNPs with 3D protein modelling. The nsSNPs that were predicted as pathogenic, highly conserved with decreased protein stability, and possessing PTM sites were chosen to proceed with 3D protein modelling using 1h88.  53 was utilised to investigate the similarities between the modelled wild-type and mutant protein structures by computing template modelling-score (TM-score) and root-mean-square deviation (RMSD) values. TM-score yields a result from 0 to 1, where 1 denotes a perfect match between both structures 54 . Precisely, 0.0 < TM-score < 0.30 indicates random structural similarity, whereas 0.50 < TM-score < 1.00 implies that both structures are within the same fold 55,56 . A lower TM-score and a higher RMSD value indicate a greater structural deviation of mutant models from those of wildtype 57 . To build the 3D models in SWISS-MODEL, templates were analysed and selected based on coverage, sequence identity, qualitative model energy analysis (QMEAN) Z-score, and global model quality estimation (GMQE) score. A QMEAN Z-score of ≤ − 4.0 denotes a low quality model 58 . The GMQE score, which ranges from 0 to 1, indicates the likely accuracy of the model constructed with that alignment and the target coverage 59 . Therefore, templates with higher sequence similarities and a higher GMQE value were prioritized, concurrent with the coverage of the mutation site in that template, thus, template 1h88.

Conclusion
MYB family members are often aberrantly expressed in human cancers, suggesting that they could be important for tumour initiation and/or maintenance. In this study, a total of 30 nsSNPs were predicted as high-risk pathogenic, conserved with decreased stability, suggesting potential deleterious effect on the protein structure. Further PTM and 3D protein modeling indicated rs1361650612 (G111S), rs1179275735 (N183S), rs766676175 (G122S), and rs1438994955 (S178C) located within the helix-turn-helix (HTH) myb-type 2 and myb-type 3 domains were identified pathogenic with the ability to potentially cause great functional and stability impairment on the proteins. This study concise confidence that these findings could serve as a benchmark towards potential diagnostic and therapeutic interventions.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.